Skip to content
This repository has been archived by the owner on May 6, 2022. It is now read-only.

4xx, 5xx and Connection timeout should be retriable (not terminal errors) #1765

Merged
merged 20 commits into from
Mar 24, 2018
Merged

4xx, 5xx and Connection timeout should be retriable (not terminal errors) #1765

merged 20 commits into from
Mar 24, 2018

Conversation

nilebox
Copy link
Contributor

@nilebox nilebox commented Feb 26, 2018

Fixes #1715

The final, third part (item 1.) of the TODO list #1715 (comment)
item 2.: #1751
item 3.: #1748

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 26, 2018
Copy link
Contributor

@staebler staebler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one question and one concern.

  1. What is the desired behavior when orphan mitigation fails? In that case, are we going to consider the instance dead and unusable? Or are we going to open the instance up for request retries and processing spec updates?
  2. Retryable requests will restart the OperationStartTime when the orphan mitigation completes. This means that retryable requests involving orphan mitigation have the potential to retry indefinitely.

if shouldMitigateOrphan {
// TODO nilebox: if failedCond == nil, we lose the original error reason/message
// in the status completely there. Should we keep the original reason/message instead?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose that we add a new OrphanMitigation condition that is true when the instance is undergoing orphan mitigation. The Ready condition is not changed when starting and stopping orphan mitigation. Presumably, the Ready condition will already be false when orphan mitigation starts. Although I'd also be fine with relying solely on OrphanMitigiationInProgress instead of adding a new condition. Either way, I suggest that we stop changing the Ready condition with the starting orphan mitigation reason. Of course, this is all follow-on work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@staebler I personally don't have objection against OrphanMitigation condition, we can discuss this separately with @pmorie and others.
As you suggest (and as my TODO suggests too), I will keep the original reason/message for now, and probably leave a TODO that it would be nice to somehow reflect the started orphan mitigation in the status somehow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raised an issue #1771 to discuss this.

setServiceInstanceCondition(instance, v1beta1.ServiceInstanceConditionFailed, failedCond.Status, failedCond.Reason, failedCond.Message)
errorMessage = fmt.Errorf(failedCond.Message)
} else {
resetServiceInstanceCondition(instance, v1beta1.ServiceInstanceConditionFailed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is resetting the Failed condition part of processing a failure? The Failed condition should be cleared out as part of starting a new provision rather than as part of finishing a provision. As it is, the Failed condition is not cleared out when the provision is successful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failedCond == nil here means that we have encountered a retriable error (see the processTemporaryProvisionFailure method).
The Failed:True condition along with the latest ObservedGeneration means: "terminal failure that doesn't need retrying until the spec is updated (and consequently the Generation is incremented)".

As a reminder from #1748:

+// isServiceInstanceProcessedAlready returns true if there is no further processing
+// needed for the instance based on ObservedGeneration
+func isServiceInstanceProcessedAlready(instance *v1beta1.ServiceInstance) bool {
+	return instance.Status.ObservedGeneration >= instance.Generation &&
+		(isServiceInstanceReady(instance) || isServiceInstanceFailed(instance)) && !instance.Status.OrphanMitigationInProgress
+}

The isServiceInstanceProcessedAlready is invoked before we start a new provisioning operation, and it will return true unless we clear the reset the Failed condition as I do there.

I don't like that this also means that we lose the reason and message there, but currently we store the same reason and message in the Ready condition, so it's probably not too bad.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@staebler actually sorry, you're right. While what I have described above is correct, we don't need to reset the Failed condition here, because it has been reset in the beginning of this operation already (assuming that #1748 is merged as it is before we merge this PR).
Will remove this code block.

is5XX
}

// isTerminalHttpStatus returns whether an error with the given HTTP status
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/isTerminalHttpStatus/isRetriableHTTPStatus

// Need to vary return error depending on whether the worker should
// requeue this resource.
if failedCond == nil || shouldMitigateOrphan {
return errorMessage
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not strictly required as we have updated the status above, so the instance will come back to the queue anyway.
Keeping it as it is, as it was written that way before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is needed but the comment is misleading. We need to prevent resetting the rate limiter.
/cc @kibbles-n-bytes this is another place where we claim that we want to requeue the resource when what we really need is to prevent invoking the Forget() method.
I will update the comment.

@nilebox
Copy link
Contributor Author

nilebox commented Feb 28, 2018

  1. What is the desired behavior when orphan mitigation fails? In that case, are we going to consider the instance dead and unusable? Or are we going to open the instance up for request retries and processing spec updates?

@staebler You mean when we reached the retry timeout reconciliationRetryDurationExceeded(instance.Status.OperationStartTime)? I think it should be unusable (but not dead), but could possibly be manually triggered again by updateRequests++?

The spec says that the platform SHOULD (i.e. not MUST) keep trying until it succeeds (nothing about time limits or max number of retries):

the Platform SHOULD attempt to delete resources it cannot be sure were successfully created, and SHOULD keep trying to delete them until the Service Broker responds with a success.

Also I love this OSB spec paragraph:

If the Platform encounters an internal error provisioning a Service Instance or Service Binding (for example, saving to the database fails), then it MUST at least send a single delete or unbind request to the Service Broker to prevent the creation of an orphan.

It sounds like "best effort but no guarantees whatsoever". Just brilliant.
MUST send delete request at least once, and SHOULD keep trying until succeeds.

We can discuss this at the next SIG call.

@nilebox
Copy link
Contributor Author

nilebox commented Feb 28, 2018

  1. Retryable requests will restart the OperationStartTime when the orphan mitigation completes. This means that retryable requests involving orphan mitigation have the potential to retry indefinitely.

@staebler Yes. Is this bad? That's how Kubernetes works - it keeps trying to deploy a Pod even if the Docker image doesn't exist in DockerHub. There might be some bug on the broker side leading to 500 Internal Server Error that can be fixed at any time, so it makes sense to keep retrying to provision.

I think there should be some monitoring tool that would cleanup all the instances that have been stuck in provisioning for too long and/or notify users (owners) about this issue (i.e. OpenShift and Atlassian will probably have something in-house in that area). But that's out of scope of Service Catalog.

If such behavior turns out to be painful, we can fix it later by introducing some retry limits probably.

@nilebox
Copy link
Contributor Author

nilebox commented Feb 28, 2018

I think it should be unusable (but not dead), but could possibly be manually triggered again by updateRequests++?

Also need to check whether the OrphanMitigationInProgress is kept after exceeding the retry timeout. I suspect that we reset it, and we might need an OrphanMitigation condition to be able to keep it.

@staebler
Copy link
Contributor

Retryable requests will restart the OperationStartTime when the orphan mitigation completes. This means that retryable requests involving orphan mitigation have the potential to retry indefinitely.

@staebler Yes. Is this bad? That's how Kubernetes works - it keeps trying to deploy a Pod even if the Docker image doesn't exist in DockerHub. There might be some bug on the broker side leading to 500 Internal Server Error that can be fixed at any time, so it makes sense to keep retrying to provision.

Service catalog specifically has the reconciliation retry duration that determines how long reconciliation of a given generation should be attempted before giving up. Let's say that the provision request failed repeatedly with 408 Request Timeout errors. By default, the controller would stop attempting to send provision requests to the broker after 7 days. At that point, the controller would set the instance to Failed:True. On the other hand, let's say that the provision request failed repeatedly with 500 Internal Server Error errors. In that case, the controller will perform orphan mitigation after each error. Each successful orphan mitigation will reset the OperationStartTime for the instance. Consequently, the controller will send provision requests indefinitely, beyond the 7-day limit. There should not be a difference in how long the controller attempts reconciliation between errors that require orphan mitigation and errors that do not.

@staebler
Copy link
Contributor

What is the desired behavior when orphan mitigation fails? In that case, are we going to consider the instance dead and unusable? Or are we going to open the instance up for request retries and processing spec updates?

@staebler You mean when we reached the retry timeout reconciliationRetryDurationExceeded(instance.Status.OperationStartTime)? I think it should be unusable (but not dead), but could possibly be manually triggered again by updateRequests++?

@nilebox Yes, that looks to be the only (non-constraint-breaking) reason why an orphan mitigation would fail.

I think that we have some inconsistencies in how we are treating broker resources if we allow further provision requests after a failed orphan mitigation. If a pure delete fails (ie, exceeds reconciliation retry duration), then the instance is kept around so that the user may take manual actions to negotiate with the broker to clean up resources. For a failed orphan mitigation, we would not be giving the user the same opportunity for manual intervention.

@nilebox
Copy link
Contributor Author

nilebox commented Feb 28, 2018

Each successful orphan mitigation will reset the OperationStartTime for the instance. Consequently, the controller will send provision requests indefinitely, beyond the 7-day limit.

Ok, I see the problem now. The easiest solution then is not to invoke the recordStartOfServiceInstanceOperation() method in case of deletion (or at least for orphan mitigation)?

@nilebox
Copy link
Contributor Author

nilebox commented Feb 28, 2018

For a failed orphan mitigation, we would not be giving the user the same opportunity for manual intervention.

@staebler I think we should keep the instance in the "unusable" state if orphan mitigation retry limit is exceeded.
By "but could possibly be manually triggered again by updateRequests++" I meant that it should trigger the new cycle of orphan mitigation, not the provisioning.

This would allow for manual intervention:

  1. Orphan mitigation fails until retry limit is exceeded, we mark the status with Failed:True condition and stop retrying.
  2. User reports about this issue to the broker developer, and the bugfix is released. Broker is now ready to process the orphan mitigation correctly.
  3. User updates the spec (e.g. updateRequests++), and triggers the reconciliation loop. Service Catalog controller sees that there is unfinished orphan mitigation (#1771 (comment)), and starts it again.
  4. Orphan mitigation has successfully finished, and now there will be a new provisioning operation with updated spec.

To be clear, I understand that we'll need #1771 or some other improvement to make sure that we will not do any provisioning until the orphan mitigation has succeeded.
If the user decides to give up, he can delete an instance and create a new one.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 5, 2018
@kibbles-n-bytes
Copy link
Contributor

kibbles-n-bytes commented Mar 8, 2018

@staebler @nilebox Perhaps this is a situation in which it'd be beneficial to use the LastTransitionTime of the new OrphanMitigation condition? From a purity standpoint, the "operation" that is in progress here is a Provision, so the operation's start time should not really be reset when we start orphan mitigating. However, we could gauge how long the orphan mitigation has been ongoing based off of the last time our OrphanMitigation condition transitioned from False/Unknown (implicit when condition not present) to True.

Would complicate the logic of the controller slightly, but I think it'd get us the desired behavior.

@nilebox
Copy link
Contributor Author

nilebox commented Mar 9, 2018

@kibbles-n-bytes your solution won't work in the case when we exhausted the retry timeout, but then the spec was updated (leading to Generation++) which should trigger another round of orphan mitigation with a new LastTransitionTime = Now. So we need to explicitly reset the "operation start time" every time we start a new round of orphan mitigation, static time can't solve this problem.

@nilebox nilebox changed the title WIP: 4xx, 5xx and Connection timeout should be retriable (not terminal errors) 4xx, 5xx and Connection timeout should be retriable (not terminal errors) Mar 9, 2018
@nilebox
Copy link
Contributor Author

nilebox commented Mar 9, 2018

@kibbles-n-bytes @staebler how about adding a new operation type CurrentOperation = OrphanMitigation? This would make OperationStartTime reflect a single orphan mitigation operation (and we would also lose the start time of Provision operation, but that's probably fine). OrphanMitigation should always be followed by a new Provisioning operation, so there is no ambiguity.

Combined with the OrphanMitigation condition and OrphanMitigationInProgress flag it gets ridiculous though, so I'm not proposing this option seriously, just throwing out another idea.

@nilebox
Copy link
Contributor Author

nilebox commented Mar 9, 2018

Actually nevermind, I lost the context of discussion :(
As it was pointed before we don't want to reset the OperationStartTime to avoid retrying indefinitely.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 21, 2018
@nilebox
Copy link
Contributor Author

nilebox commented Mar 21, 2018

/retest

@nilebox
Copy link
Contributor Author

nilebox commented Mar 22, 2018

@staebler @kibbles-n-bytes rebased on the latest mater after merged other non-happy-path PRs, please review.

Each successful orphan mitigation will reset the OperationStartTime for the instance. Consequently, the controller will send provision requests indefinitely, beyond the 7-day limit.

@staebler we don't do that anymore (merged in the other PR) - we don't invoke recordStartOfServiceInstanceOperation for orphan mitigation anymore.

@nilebox
Copy link
Contributor Author

nilebox commented Mar 22, 2018

/retest

// isRetriableHTTPStatus returns whether an error with the given HTTP status
// code is retriable.
func isRetriableHTTPStatus(statusCode int) bool {
return statusCode != 400
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI there are constants for HTTP status codes in the http package

Copy link
Contributor Author

@nilebox nilebox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving comments for myself

// Need to vary return error depending on whether the worker should
// requeue this resource.
if failedCond == nil || shouldMitigateOrphan {
return errorMessage
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is needed but the comment is misleading. We need to prevent resetting the rate limiter.
/cc @kibbles-n-bytes this is another place where we claim that we want to requeue the resource when what we really need is to prevent invoking the Forget() method.
I will update the comment.

@@ -1662,12 +1681,14 @@ func (c *controller) processUpdateServiceInstanceSuccess(instance *v1beta1.Servi
// processUpdateServiceInstanceFailure handles the logging and updating of a
// ServiceInstance that hit a terminal failure during update reconciliation.
func (c *controller) processUpdateServiceInstanceFailure(instance *v1beta1.ServiceInstance, readyCond, failedCond *v1beta1.ServiceInstanceCondition) error {
// TODO nilebox: We need to distinguish terminal and temporary errors there
// but we need to merge https://github.com/kubernetes-incubator/service-catalog/pull/1748
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1748 is merged, so we can fix this TODO

Copy link
Contributor Author

@nilebox nilebox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra comments on PR to help reviewers

// Don't reset the current operation if the error is retriable
// or requires an orphan mitigation.
// Only reset the OSB operation status
clearServiceInstanceAsyncOsbOperation(instance)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kibbles-n-bytes note that I am reusing your newly introduced method there and in the update failure handling too.
Please double check if this is a correct behavior.
The goal here is to keep the OperationStartTime untouched to avoid retrying indefinitely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea behind this seems ok to me

Copy link
Contributor

@kibbles-n-bytes kibbles-n-bytes Mar 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nilebox As a consequence of this, it seems like we'll only be trying the deprovision once in the case of a final orphan mitigation after a reconciliation retry duration timeout during a provision. For now, I think it's fine, though we should re-look at it in the future.

// But we still need to return a non-nil error for retriable errors and
// orphan mitigation to avoid resetting the rate limiter.
if failedCond == nil || shouldMitigateOrphan {
return errorMessage
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the comment above that clarifies why do we need to return a non-nil error.

Copy link
Contributor

@arschles arschles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retries seem like a good idea to me, and my understanding is that they don't violate the OSB spec. Code looks fine too.

LGTM

// Don't reset the current operation if the error is retriable
// or requires an orphan mitigation.
// Only reset the OSB operation status
clearServiceInstanceAsyncOsbOperation(instance)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea behind this seems ok to me

@arschles arschles added the LGTM1 label Mar 23, 2018
@kibbles-n-bytes
Copy link
Contributor

kibbles-n-bytes commented Mar 23, 2018

@nilebox Just to be sure, neither this nor #1789 add logic that re-triggers orphan mitigation after it has already failed, if someone edits the spec, correct? So resources that failed orphan mitigation are stuck in a permanent failed state. If so, I'm okay with that, and we can revisit it later.

@kibbles-n-bytes
Copy link
Contributor

LGTM. RIP everyone wishing to fix the resources after they've fat-fingered a parameter, though. 💀 Making 400 Bad Request a terminal failure will help with that, but retrying async operations could still get us in a sticky situation. I'm okay with it going in, but we gotta fix that ASAP. Until then, though, users can modify the default reconciliation retry duration via the --reconciliation-retry-duration parameter on the controller-manager binary.

@kibbles-n-bytes
Copy link
Contributor

Retriggering Jenkins just for peace of mind. If it's green, please merge.

@pmorie
Copy link
Contributor

pmorie commented Mar 24, 2018

RIP everyone wishing to fix the resources after they've fat-fingered a parameter, though

Let's not introduce a regression here - people already have issues with recovering from a change to a bad parameter state.

Can you describe the situation you're talking about?

@nilebox
Copy link
Contributor Author

nilebox commented Mar 24, 2018

Just to be sure, neither this nor #1789 add logic that re-triggers orphan mitigation after it has already failed, if someone edits the spec, correct? So resources that failed orphan mitigation are stuck in a permanent failed state.

@kibbles-n-bytes Nope. Orphan mitigation will be retried for a week (by default). And if the max retry duration is exceeded, we will stop retrying. But after the spec is updated, it will trigger orphan mitigation first, followed by reconciling a new spec (only after/if orphan mitigation has succeeded).

If you disagree with this behavior, or think that the code behaves differently, please comment.

@nilebox
Copy link
Contributor Author

nilebox commented Mar 24, 2018

Let's not introduce a regression here - people already have issues with recovering from a change to a bad parameter state.

@pmorie just to be clear here: currently people have no way of recovering from a change to a bad parameter state at all.
This PR adds support for:

  1. retry loop after any failure except 400 Bad Request
  2. orphan mitigation and reconciliation after spec is updated.

So it's alone a huge improvement. And there could not be a regression for behavior which was missing before.

On the other hand, and that's what @kibbles-n-bytes is referring to, the retry loop also makes #1755 more obviously annoying for non-happy-path: the spec will be locked for a week if OSB broker continuously returns an error on provisioning (any error code except 400 Bad Request).

I definitely agree that we need to fix #1755 ASAP, and that's why I have been pushing for it for weeks now.

@nilebox
Copy link
Contributor Author

nilebox commented Mar 24, 2018

So resources that failed orphan mitigation are stuck in a permanent failed state.

@kibbles-n-bytes ah sorry, you mean the ones that are already marked with the terminal Failed condition with an OrphanMitigationInProgress=true flag (before this change)? They won't automatically recover, yes, but after the spec is updated, it should trigger orphan mitigation (I think) and reconciling the new spec. There's no more dead end terminal error state for instances, so we should be able to recover from any state after updating Service Catalog (apart from DeletionTimestamp != nil).

@duglin
Copy link
Contributor

duglin commented Mar 24, 2018

Let's be clear though, we could remove the lock for the "same user" to unblock most people and not have any concern about OSB API spec violations. That would be an easy fix (I think), that could probably be done ( by someone who knows the code) in short order.

@nilebox
Copy link
Contributor Author

nilebox commented Mar 24, 2018

@duglin what I meant to say is that current behavior is unacceptable. One way or another, we need to fix that ASAP to whatever we all agree to.

@nilebox nilebox deleted the temporary-http-errors branch March 24, 2018 14:09
@duglin
Copy link
Contributor

duglin commented Mar 24, 2018

yes but my point is that the urgency of this can be GREATLY diminished if we would just unlock it for the same user. That would give us time to make the right design decision (for kube and osbapi compliance) and not make a rushed one.

I doubt this would take more than a few lines of code to do - it should just require a modification to the lock check to add something like request.UserInfo != instance.UserInfo

@@ -761,14 +769,12 @@ func (c *controller) pollServiceInstance(instance *v1beta1.ServiceInstance) erro
reason := errorProvisionCallFailedReason
message := "Provision call failed: " + description
readyCond := newServiceInstanceReadyCondition(v1beta1.ConditionFalse, reason, message)
failedCond := newServiceInstanceFailedCondition(v1beta1.ConditionTrue, reason, message)
err = c.processProvisionFailure(instance, readyCond, failedCond, false)
err = c.processTemporaryProvisionFailure(instance, readyCond, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This introduced a bug where if the polling operation returns state=failed, in combo with the change on this line: https://github.com/kubernetes-incubator/service-catalog/blob/e15b73719911853d3755b71f3d8b26b21296d0a3/pkg/controller/controller_instance.go#L1652

causes a failed provision after polling to never attempt deprovision.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@n3wscott apologies, this is because I thought that for async operations we never need to perform an orphan mitigation.
If what @kibbles-n-bytes wrote about bindings in openservicebrokerapi/servicebroker#334 (comment) is true about instances too, then we could just change a flag to true there:

err = c.processTemporaryProvisionFailure(instance, readyCond, true)

i.e. perform orphan mitigation immediately instead of waiting for deletion of the instance.
Thoughts?

@n3wscott also, can you create an issue to track this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had: #1879

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. LGTM1 LGTM2 non-happy-path size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants